The capstone begins where every real engagement begins — by deciding exactly what you're evaluating, and what "safe enough" means
Day 56 of 60
For eleven weeks you've built artifacts: a threat model, a taxonomy and policy, a red-team plan, an eval suite, a robustness report, an alignment and interpretability brief, a risk register, a governance gap list. Each was real and each stood alone. This week you do the thing that actually makes someone a safety lead instead of a collection of skills: you assemble them into one coherent safety evaluation program — a binder a review board could read and act on.
A pile of strong artifacts is not a program. A program is artifacts tied to a single deployment, a single definition of "safe enough," and a single recommendation. The integration — making them point the same direction and tell one story — is the work this week, and it's the work that gets hired.
And like every real engagement, it starts not with testing but with scoping. Before you evaluate anything you have to say precisely what you're evaluating, who could be harmed, and what bar the system must clear to ship. Get the scope wrong and every downstream artifact measures the wrong thing beautifully.
Not "a model" but a deployment: which model, in which product, with which capabilities and access. "Model X as a tool-using assistant inside a customer-support app, with retrieval over our docs" is a scope. "Is the model safe?" is not. Capabilities and access (tools, browsing, memory, code execution) are part of the subject because they're where most real risk lives.
One deployment's acceptable risk is another's red line. Write the goal as a falsifiable bar: harmful-compliance rate below a threshold, no untested high-severity attack class, over-refusal under a ceiling. The goal is what your evals later either clear or don't — so it has to be decided before you see results, not after.
Reuse your Week 1 threat model: who and what are you protecting, and who owns the go/no-go? A scope names the review board, the decision date, and the person accountable for the call. A program with no named owner is a document, not a decision.
You already have a threat model from Week 1. Don't write a new one — instantiate it for this specific deployment. Scoping is mostly the act of taking general artifacts and binding them to one concrete system, with one timeline and one decision-maker.
The fastest way to scope is to lay out the binder's table of contents and write, next to each section, which artifact fills it. This turns the capstone from "write a huge document" into "assemble eight things I already built." It also surfaces gaps early: if a section has no artifact behind it, that's the work the rest of the week closes.
Threat model (W1) → taxonomy + policy (W2) → red-team plan (W3) → eval suite (W4) → robustness report (W5) → alignment + interpretability note (W7–8) → risk register (W10) → governance gap list (W11) → recommendation. Tomorrow you start bolting them together; today you decide what they're all in service of.
A junior practitioner opens a safety engagement by testing. An expert opens by scoping — fixing the subject, the bar, and the decision-maker first — because a test against an undefined bar can't pass or fail, only flatter. The altitude jump is from "I ran evaluations" to "I ran a program against a pre-registered definition of safe enough, with a named owner and a decision date."
Say this in an interview: "Before I evaluate anything, I scope: which deployment, with which capabilities, against what falsifiable bar, owned by whom, decided by when. Pre-registering 'safe enough' is what keeps the eval honest — otherwise you're just grading the model on a curve you drew after seeing its answers."